Statistical properties of neutral evolution 
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Neutral evolution is the simplest model of molecular evolution and thus it is most amenable to a 
comprehensive theoretical investigation. In this paper, we characterize the statistical properties of 
neutral evolution of proteins under the requirement that the native state remains thermodynamically 
stable, and compare them to the ones of Kimura's model of neutral evolution. Our study is based 
on the Structurally Constrained Neutral (SCN) model which we recently proposed. We show that, 
in the SCN model, the substitution rate decreases as longer time intervals are considered, and 
fluctuates strongly from one branch of the evolutionary tree to another, leading to a non-Poissonian 
statistics for the substitution process. Such strong fluctuations are also due to the fact that neutral 
substitution rates for individual residues are strongly correlated for most residue pairs. Interestingly, 
structurally conserved residues, characterized by a much below average substitution rate, are also 
much less correlated to other residues and evolve in a much more regular way. Our results could 
improve methods aimed at distinguishing between neutral and adaptive substitutions as well as 
methods for computing the expected number of substitutions occurred since the divergence of two 
protein sequences. 
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INTRODUCTION 

The molecular clock hypothesis, proposed by Zuck- 
erkandl and P auling about 40 years ago (Zuckerkandl & 
Pauling 1962 ) , has been a cornerstone in the foundations 
of molecular evolution. In comparing the sequences of 
homologous proteins, Zuckerkandl & Pauling ( 1962 ) ob- 
served that the rate of amino acid substitutions per site 
per year in a given protein family is approximately con- 
stant, independently of the pair of species compared. In 
particular, the substitution rate appeared to depend only 
weakly on the number of individuals in the population 
and on ecological variables, a property which is extremely 
useful for reconstructing phylogenetic trees from the com- 



parison of protein sequences (Saitou & Nei 1987; Thomp- 
son et al. 1994; Grauer & Li 2000). The constancy of the 



substitution rate was shown some years later to be an 
immediate consequence of the neutral theory of molecu- 
lar evolution, propose d inde pendently by Kimura (1968) 
and by King & Jukes (|l969|) . 

The proposal of the neutral theory raised heated con- 
troversies, in part because it challenged the expectation 
that most variations in the genetic material are driven 
by positive natural selection. At the present time, it 
is however generally recognized that both neutral and 
adaptive substitutions play an important role in the evo- 
lution of protein sequences, and methods are being devel- 
oped to distinguish between them. Moreover, the neutral 
paradigm has been extended to inclu de slightly d eleteri- 
ous mutations, as proposed by Ohta (1973, 1976). Mod- 
els of slightly deleterious mutations are however rather 
complicate from the point of view of population genetics, 
and will not be considered here. 

Despite this debate, less attention has been drawn to 
the question whether Kimura's neutral model adequately 
describes the statistical properties of neutral evolution of 
proteins. This question is an important one, since an 
accurate knowledge of such expected properties would 
allow to better distinguish between neutral and adaptive 
substitutions and to improve methods to reconstruct evo- 
lutionary trees from sequence alignments. 

Kimura's model assumes that the overwhelming ma- 
jority of mutations are either effectively neutral (their 
effect on the fitness is much smaller than the inverse 
of the effective population size) or lethal, in the latter 
case purged by negative selection. We will call this as- 
sumption the neutral hypothesis. Moreover, the rate 
of occurrence of neutral mutations is regarded as con- 
stant throughout evolution, independent of the current 
sequence. This is the homogeneity hypothesis. As a con- 
sequence, the rate of neutral substitutions is predicted 
to be constant, and the number of neutral substitutions 
being fixed after k generations of evolution is predicted 
to be a Poissonian variable of mean value ^xk, where 
is the neutral mutation rate. At variance with this, the 



tion distribution appears to be larger than that expected 
for a Poissonian distribution. However, since deviations 
seemed at first to be small, the Poissonian statistics was 
still regarded as a valid first approximation. More accu- 
rate later studies showed that such deviations are in fact 
not small (Langley & Fitch |1973| ; Gillespie [19981 ). This 
discrepancy between the model and empirical observa- 
tions was taken by Gillespie as evidence against the neu- 
tral hypothesis, and he favored the hypothesis that most 
substitutions in protein sequences are fixed by positive 
selecti on (Gillespie 1991). On the other hand, Takahata 
(1987) proposed a modification of the neutral model, the 
fluctuating neutral space model, that can account for the 
non-Poissonian statistics of substitutions, still preserving 
the simplicity of the neutral model as the simplest popu- 
lation genetics model. This is one of the few instances in 
which the debate, instead of concentrating on the neu- 
tral hypothesis, reconsidered the way in which Kimura 
had modeled neutral mutations. 

Recently, we have introduced a model in which the 
neutrality of mutations in protein sequences is explic- 
itly tested by means of a computational model of protein 
folding (BastoUa et al. |1999| , |20004 |2002a| ^002b| ). The 
statistical properties of the model are rather robust: We 
obtained the same qualitative behavior by using two fold- 
ing models as different as a well designed lattice model 
(Abkevich et al. 1994) and a model which allows recogni- 
tion of native protein folds (Bastolla et al. 2001, 2000b). 
In particular, we found that the hypothesis of homogene- 
ity should be rejected and that the neutral mutation rates 
fluctuate broadly during evolution. As a consequence, 
the variance of the substitution process is much larger 
than that expected for a Poissonian distribution, in rea- 
sonable agreement with the statistics observed for most 
protein families. 

Our model belongs to the "structural approach" to 
molecular evolution, which has been made possible by the 
recent advances in the understanding of the dynamics of 
protein folding and the thermodynamics of biomolecules. 
In particular, it is now possible to assess approximately 
the thermodynamic stability of biomolecules by computa- 
tional methods. We have undertaken the challenge of ex- 
ploring the applications of these new methods to the field 
of molecular evolution, where they can complement more 
traditional methods based on population genetics and 
on sequence comparison. The structural approach has 
been pioneered by Schuster and co-workers with a series 
of studies of neutral networks of RNA secondary struc- 



tures (Schuster et al. 1994; Huynen et al. 1996; Fontana 
& Schuster 1998| ) and it has been applied to proteins 
by several other groups (Shakhnovich et al. 1996; Busse- 
maker et al. 1997; Govindarajan & Goldstein 1997; Baba- 
jide et al. 1997; Bornberg-Bauer 1997; Tiana et al. 
1998; Govindarajan & Goldstein 1998; Bornberg-Bauer 



Chan 1999 



first tests performed by Ohta & Kimura (1971) on homol- 
ogous proteins showed that the variance of the substitu- 



Mirny & Shakhnovich 1999; Dokholyan & 



Shakhnovich 2001; Xia & Levitt 2002) 



In this paper, we study the statistical properties of 
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neutral evolution of proteins as they are obtained from 
the Structurally Constrained Neutral (SCN) model that 
we have recently introduced (BastoUa et al. 2002a). Our 
goal is to explore the consequences of these statistical 
properties on molecular evolution. We start the paper 
by reviewing Kimura's model of neutral evolution, mak- 
ing explicit the homogeneity hypothesis and the argu- 
ments that seem to support its validity. In the next sec- 
tion, we review the SCN model, on which our results are 
based. Conservation of the protein fold during evolution 
is discussed in the following section, where we argue that 
fold conservation can be regarded as a consequence of 
the requirement of thermodynamic stability. We then 
investigate the origin of the strong fluctuations in the 
fraction of neutral neighbors that are found within the 
SCN model and that produce the non-Poissonian statis- 
tics of the substitution process. We define the neutral 
connectivities at each position in the protein and study 
their pairwise correlations. We find that even connectiv- 
ities at positions far apart in the structure are strongly 
correlated. Remarkably, the only exceptions to this be- 
havior are represented by structurally conserved posi- 
tions, which are much less correlated with other posi- 
tions and evolve in a more regular fashion. These spatial 
correlations are the counterpart of the temporal correla- 
tions that we described in an earlier work (BastoUa et 
al. 2002b). In the following sections, we present some 
consequences of the broad distribution of neutral rates. 
We show that: (i) The neutral mutation rate is not con- 
stant as a function of the time interval considered, but 
decreases monotonously as larger time intervals are con- 
sidered, (ii) The normalized variance of the substitution 
process increases with the time interval, and is in rea- 
sonable agreement with the data on the dispersion in- 
dex for most proteins, (iii) Fluctuations in the neutral 
connectivities can influence the generation time effect. 
Our simulations are then used to test methods applied 
to estimate the number of substitutions in the diver- 
gence of two sequences. Finally, we discuss the impact of 
our findings on methods used to detect positive selection 
(as fo r inst ance the one proposed by McDonald & Kreit- 
man ( 1991 )), which are usually based on neutral models 
with constant neutral mutation rate. After summarizing 
our results in the Conclusions, the section 'Materials and 
Methods' reports the details of some technical points. 



KIMURA'S NEUTRAL MODEL REVISITED 

As discussed in the Introduction, Kimura's neutral 
model is based on two assumptions. The neutral hy- 
pothesis is equivalent to the assumption that the most 
common mutations in protein sequences are either dis- 
ruptive and eliminated by negative selection, or neu- 
tral, i.e. they leave the protein active and their effect 
on fitness is much smaller than the inverse of the effec- 
tive population size. This mutational spectrum implies 



that protein sequences evolve on a neutral network, i.e. 
a set of sequences where the protein is active and which 
can be connected through point mutations. Fixation of 
slightly deleterious mutations, as well as advantageous 
mutations, are not included in the model. This is of 
course an important limitation of neutral models. 

The second assumption is that the rate of appear- 
ance of neutral mutations is constant throughout evolu- 
tion (homogeneity hypothesis). In a 1977 paper, Kimura 
commented that rate constancy may not hold exactly 
(Kimura [1977 ), but he did not develop the consequences 
of the violation of this hypothesis any further. 

The neutral model predicts that the rate of fixation of 
neutral mutations in an evolving population is equal to 
the rate of their appearance, independently of the popu- 
lation size, because the number of appearing mutations 
is proportional to the population size and the probabil- 
ity of their fixation is inversely proportional to it. Let 
us consider a given lineage evolving for a time interval t. 
The number of mutations appearing in this time inter- 
val is expected to obey a Poissonian statistics with mean 
value /it, where is the mutation rate. We call neutral 
connectivity the probability that one of such mutations 
is neutral, which is proportional to the connectivity of 
the neutral network, and denote it by x. The value of x 
depends on the particular protein family chosen but it is 
assumed to be constant throughout evolution. As a con- 
sequence, the probability V{St = n) of the occurrence of 
n neutral mutations within a time interval t is 



a;"(l-2;)'"-", (1) 



where the factor (™)a:'"(l — x)™~" gives the probability 
that n out of m mutations are neutral. The summation 
can be performed exactly, yielding 



i) — e -— , 



(2) 



so that the number of neutral mutations also obeys a 
Poissonian statistics, with mean value ^xt and rate fix. 

In principle the neutral connectivity x{A) depends on 
the amino acid sequence A considered, but Kimura's 
model assumes that a;(A) is the same for all protein se- 
quences belonging to the same structural and functional 
family. This assumption can be justified with the fol- 
lowing argument: Let Xi(A) be the fraction of possible 
mutations of sequence A at position i which are neutral. 
Then the overall fraction of neutral mutations is just the 
average of this quantity over the N positions in the amino 
acid chain. 



(3) 



If the neutral connectivities at different positions are un- 
correlated, or if only positions which are close in the na- 
tive structure are correlated, then the variance of a;(A) 
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is proportional to \/N ^ which is very small for long pro- 
tein chains, so that the approximation of constant a; is 
expected to be reasonably good. As we shall see in the 
following, this is not the case. 



THE STRUCTURALLY CONSTRAINED 
NEUTRAL MODEL 

In the SCN model we assume the validity of the neu- 
tral hypothesis, but we do not make any assumption re- 
garding the neutral mutation rate. Instead, we estimate 
explicitly the effect of a mutation on protein stability us- 
ing an effective model of protein folding (BastoUa et al. 
2000b ) which provides us with a genotype to phenotype 
mapping. In this respect, the rate of occurrence of neu- 
tral mutations is an outcome of the model. We show 
that this rate displays very broad fluctuations through- 
out evolution. 

The SCN model addresses protein evolution at the level 
of a single sequence, and it does not take into account 
population dynamics. This simplification is justified by 
the fact that, within Kimura's model, the substitution 
rate does not depend on the population size. Neverthe- 
less, population size might influence the evolutionary pro- 
cess if the rate of neutral mutations shows broad fluctua- 
tions, as observed here. The explicit inclusion of popula- 
tion genetics into the model will be needed to investigate 
this interesting possibility. 



Estimating protein stability 

In our model of protein folding, we evaluate effec- 
tive conformational energies (temperature and pH de- 
pendent) using the effective energy function described by 
Bastolla et al. ( ^001 , 2000b), which provides good ther- 
modynamic properties for protein structures in the Pro- 
tein Data Bank (PDB). For each sequence typically sev- 
eral millions alternative conformations of the same length 
are generated by gapless threading. The native confor- 
mation is identified as the lowest energy conformation. 
The energy function has several non-trivial properties: 
(i) It assigns lowest energy to the experimentally known 
native structures of basically all single chain protein se- 
quences; (ii) It provides the native structure with a well 
correlated energy landscape (see below), (iii) For chains 
belonging to oligomeric proteins, for which interchain in- 
teractions are neglected, the experimentally known na- 
tive state shows a deficit of stabilizing energy, well corre- 
lated with the amount of neglected free energy, (iv) The 
effective energy function that we use is able to estimate 
the folding free energy of proteins with two states kinetics 
reasonably well (U. Bastolla, unpublished). 

There are two necessary conditions for thermodynamic 
stability (a) a low folding free energy, i.e. the unfolded 
state is less stable than the native state; (b) stability with 



respect to alternative misfolded states, i.e. the protein 
has a well correlated energy landscape, a property which 
turns out to be crucial in all simple models of protein 
folding. To enforce the first kind of stability, we estimate 
the folding free energy, which, in the simplest approxima- 
tion, is just a linear function of the effective energy of the 
native state. The second condition is estimated through 
two computational parameters: The Z-score (Bowie et al 
1991; Goldstein et al. 1992), which measures the differ- 



ence between the native energy and the average energy in 
units of standard deviation of the energy, and the normal- 



ized energy gap a (Gutin et al. 1995; Bastolla et al. 1999), 



measuring the minimal value of the energy gap between 
an alternative conformation and the target one divided 
by their structural distance. A positive and large value 
of the a parameter ensures both that the target confor- 
mation has lowest energy and that the energy landscape 
is well correlated, in the sense that conformations very 
different from the native one have very high energy. 

We impose stability conservation by requiring that the 
lowest energy state of the model coincides with the ex- 
perimentally known native state and allowing its stability 
parameters, previously described, to be off not more than 
1.5% with respect to the values of the corresponding PDB 
sequence. 



Exploring the neutral network 

Simulations of protein evolution are performed starting 
from a protein sequence in the PDB. Evolution is con- 
strained to viable sequences, which are sequences where 
the native structure is thermodynamically stable. A neu- 
tral network is a set of viable sequences which can be 
connected to the starting sequence through point muta- 
tions passing on other viable sequences. Thus sequences 
on a neutral network share the same protein fold and are 
evolutionarily connected. 

For every amino acid sequence A in the neutral 
network we measure the fraction of neutral neighbors 
x{A) G (0, 1], which is the fraction of possible point mu- 
tations that are viable. Since this number defines the 
connectivity of the neutral network at point A, we shall 
also call x{A) the neutral connectivity of sequence A. In 
the framework of our model, this is the only property of 
a sequence which influences its evolution. 

Subsequently visited sequences belonging to the 
neutral network constitute an evolutionary trajec- 
tory {Ai,A2,...}. To each trajectory is associated 
the list of the corresponding neutral connectivities 
{x(Ai),x(A2),...}. 



Substitution process 

An amino acid substitution is controlled by two in- 
dependent events: A random mutation, described as a 
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Poissonian process as in the previous section, and an ac- 
ceptance process which consists in testing whether the 
sequence is viable. The acceptance probability for a mu- 
tation taking place when the protein is in sequence A 
is given by a; (A). As a result of the fluctuations in the 
neutral connectivities, the resulting substitution process 
is no longer Poissonian. For a given sequence of neu- 
tral connectivities {a;(Ai), a;(A2), . . .} we can compute 
the probability that the number St of accepted muta- 
tions in a time interval t is equal to n. This is just the 
product of the probability (Poissonian) that k mutations 
take place in the time interval t times the probability 
that n of these are accepted. 



oo ( +\k 

P(5t=n) = ^e-''*^Pacc(r^|fc) 

k—n 



(4) 



where the acceptance probability of n mutations out of 
k is given by 



n.+ l 



(5) 
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Here, the {wj} are all integer numbers between zero and 
k — n satisfying X]j=i ™j = k — n. In other words, the 
probability that a mutation is accepted is a;(Ai) as long 
as the protein sequence is Ai, a;(A2) as long as the se- 
quence is A2, and so on. 

Two kinds of random variables have to be distin- 
guished. The first kind of variables represent the mu- 
tation and acceptance process. The average over these 
variables is indicated by angular brackets ( • ), and it 
is still a random variable dependent on the realization 
of the evolutionary trajectory. The average over evolu- 
tionary trajectories is indicated by an overline • . In the 
biological metaphor, the average over mutation and ac- 
ceptance correspond to population averages and the av- 
erage over different evolutionary trajectories correspond 
to averages over independent populations. 

If all sequences have the same fraction of neutral neigh- 
bors a;(A) = X, the number of substitutions in a branch 
of length t is Poissonian with mean fitx and the substi- 
tution rate is equal to fix, as in Kimura's model. If the 
variance of the neutral connectivity is not zero, the sub- 
stitution distribution is more complicated and has to be 
computed numerically using the simulated evolutionary 
trajectories (see the section 'Materials and Methods'). 



nor it depends on the protein structures considered, al- 
though the studied proteins cover a broad spectrum of 
different biological activities. The protein structure only 
determines which positions play a structural role and are 
therefore more conserved than average positions, but the 
properties of structurally conserved residues, like the fact 
that their evolutionary rate is less dependent on the con- 
text of the sequence (see below), are general features for 
all structures. Although some properties that we deter- 
mined may depend on the length of the protein, our data 
is insufficient to quantify such an effect. 

We studied in this work eight protein folds, con- 
sidering for the rubredoxin fold two different struc- 
tures, one of a mesophilic and one of a thermophilic 
species. The nine proteins studied are: myoglobin 
(PDB code la6g), cytochrome c (PDB code 451c), 
lysozyme (PDB code 31zt), ribonuclease (PDB code 
7rsa), rubredoxin (mesophilic species: PDB code liro; 
thermophilic species: PDB code Ibrf _A), ubiquitin (PDB 
code lu9a_A), the TIM barrel (PDB code 7tim_A), and ki- 
nesin (PDB code lbg2). In what follows we describe gen- 
eral properties of the model which are qualitatively the 
same for all proteins studied. Unless otherwise stated, 
the results that we present refer to the myoglobin fold. 

Although the model of protein stability that we use is 
only an approximate one, it is noteworthy that our most 
important qualitative results, the broad distribution of 
neutral connectivities and the overdispersion which is 
caused by correlations along neutral trajectories (see be- 
low) , reproduce those of a former work in which we used 
a model of protein stability in some sense complementary 
to the present one (Bastolla et al. 1999, 2000b). In that 



work the statistical mechanics of protein sequences was 
simulated using a lattice model to represent the ensem- 
ble of conformations. Such a treatment provides only a 
coarse-grained representation of protein structure (e.g., 
secondary structures are not well described). It does, 
however, allow for the enforcement of a rigorous crite- 
rion of stability, that has been then tested through ex- 
tensive Monte Carlo simulations. The fact that quali- 
tatively similar results are recovered using two comple- 
mentary approximations to the protein folding problem 
makes us confident that our model captures some "uni- 
versal" properties of protein evolution. 



SCN and structural conservation 



RESULTS 

An important property that we have shown to hold for 
the SCN model is that its statistical properties are robust: 
Their qualitative behavior does neither depend on the 
thresholds used to select viable sequences (and does even 
not change if lattice models are used instead of effective 
models of protein folding with real protein structures), 



It is commonly observed that protein structures are 
much more conserved than amino acid sequences (Holm 
& Sander 1996 ; Rost 1997 ), but there is no real evolu- 
tionary justification to impose such a rule as we do in 
the SCN model. In fact, the target of natural selection is 
protein function rather than protein structure, and the 
relationship between these two properties is not a simple 
one. It is well known that proteins with the same fold 
can perform quite different functions, and in some cases 
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there is evidence that those proteins are evolutionarily 
related. In fact, proteins often acquire novel functions 
by gene duplication though maintaining a very similar 
structure. Moreover, the case of structurally unrelated 
proteins which perform the same function is not rare, a 
possible result of convergent evolution. 

A more realistic version of the SCN model should im- 
pose conservation of protein thermodynamic stability, ir- 
respective of the native structure. This, in our opinion, 
should be a requirement of any model of molecular evo- 
lution. We expect that such an improved version of the 
SCN model would show that the conservation of the na- 
tive structure is a consequence of the stability require- 
ment. The reason for this expectation is based on our 
simulations of the present SCN model and of a previous 
lattice version of it. In the SCN simulations we record 
the smallest value of the normalized energy gap a with 
the target conformation, over all neighbors of a sequence 
belonging to the neutral network. For seven out of the 
nine proteins that we studied, the smallest value of a 
remained positive for all of the order of 10^ examined 
sequences. Since a positive a means that the target con- 
formation has the lowest energy, imposing that the target 
conformation is stable is effectively equivalent to impos- 
ing that the native (lowest energy) conformation is stable. 
Thus stability requirements alone suffice to guarantee the 
conservation of these protein structures. The two excep- 
tions that we found are cytochrome c and the mcsophilic 
version of rubredoxin, the two smallest proteins that we 
studied. For these proteins some sequences in the neu- 
tral network have neighbors where a structure different 
from the target state has lowest energy. However, this 
is not enough to ensure thermodynamic stability of this 
structure, since we still have to impose that all structures 
unrelated to it have much higher energy, while the energy 
of the target structure is very low. We shall investigate 
this subject in more detail in the future. 

From the above considerations it turns out that im- 
posing a well correlated energy landscape through a con- 
dition on the normalized energy gap makes it difficult to 
change the native structure maintaining the thermody- 
namic stability of the new structure. This result is very 
different from the one obtained in the study of neutral 
networks of RNA. In this case, Schuster and co-workers 
have shown through a computational study that the neu- 
tral networks of two different RNA secondary structures 
can be separated by just one point mutation (Schuster et 
al. 1994). We think that this difference is not only due to 



the difference between protein folding and RNA folding, 
but also to the different model of thermodynamic sta- 
bility applied. Schuster and co-workers require that the 
target structure has the lowest energy, while we impose 
additionally that it has a large folding free energy and a 
large normalized energy gap. 
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FIG. 1: Distribution of the fraction of neutral neighbors for 
the myoglobin fold, shown as P{x) vs x. 



Spatial correlations 

As we have seen, if the neutral connectivities Xi of dif- 
ferent positions are only weakly correlated, the overall 
neutral connectivity x has very small fluctuations and 
the neutral model of Kimura is valid. However, we ob- 
served in a previous work that the connectivity distri- 
bution is quite broad, see Fig. which implies that the 
statistics of the substitution process is very different from 
the Poissonian statistics. To investigate the origin of this 
important property, we examined the correlation matrix 
Cij , whose elements are the correlation coefficients of the 
neutral connectivities at positions i and j (see the section 
'Materials and Methods'). 

Two interesting observations emerge. First, the corre- 
lations coefficients dj are positive and large (larger than 
0.2) for most positions, irrespective of whether they are 
in contact or not in the native structure. Fig. ^ shows 
a comparison between the correlation matrix Cij (upper 
right part of the central plot in the figure, see the sec- 
tion 'Materials and Methods') and the contact matrix of 
the native structure (lower left part of the central 
plot of the figure). Second, there is a strong relation 
between correlation and conservation. Positions which 
are more conserved, as indicated by the neutral network 
average ilxl) of their fraction of neutral neighbors, show 
only a weak correlation with other positions. We plot 
as black dots in the upper right part of Fig. || pairs of 
positions which are only weakly correlated, Cij < 0.3. 
These dots arrange in horizontal and vertical lines at the 
conserved positions. The pattern is more clearly shown 
in Fig. m representing a scatter plot of the variability 
index versus the correlation coefficient Cj between the 
neutral connectivity at position i and the overall neutral 
connectivity (see the section 'Materials and Methods'). 
It can be seen that the two quantities are strongly corre- 
lated. 

Ptitsyn & Ting (1999) showed that there are two 



groups of conserved residues in the globin family. In the 
myoglobin sequence that we considered the first group 
is the heme-binding site, formed by the residues Leu29 
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FIG. 2: Comparison between cross-correlations and contact matrix. We show the contact matrix dj in the lower left part of 
the central plot (a dot indicates the residues i and j are in contact) and the cross-correlation matrix dj in the upper right part 
of the same plot (a black dot means correlations are weak or absent, dj < 0.3, whereas white means dj > 0.3). Above and 
right of the central plot we show the mean fraction of neutral mutations at position i, Tl. Conserved positions (low Tl) show 
the weakest correlations. The plot at the very right shows the histogram of the values obtained for the cross-correlations and 
the 'color code' applied for the central plot. (Figure has reduced file size and quality, original figure upon request.) 



(BIO), Leu32 (B13), Phe33 (B14), Pro37 (C2), Phe43 
(GDI), Phe46 (CD4), Leu61 (E4), His64 (E7), Val68 
(Ell), Leu89 (F4), His93 (F8), Ile99 (FG5), Leul04 (G5), 
and Hisl42 (H19), where the number in parenthesis indi- 
cates the standard notation of Perutz et al. ( 1965| ) that 
specifies the position within an hehx. The second group, 
whose conservation is structural and not functional (Ptit- 
syn & Ting |l999| ), is formed by residues VallO (AS), 
Trpl4 (A12), Ilelll (G12), LeullS (G16), Metl31 (H8), 
and Leul3 (H12). In agreement with our argument, the 
residues in the second group are among those with the 
lowest values of C^, they are ranked 9th, 6th, 22th, 24th, 
1st, and 14th, respectively. On the contrary, residues 
in the functionally conserved group are characterized by 
average values of C^, ranging in rank from from 12 to 
132, with an average rank of 61. Even more interest- 
ingly, residue MetlSl, the one with the lowest C;, is anti- 
correlated (although weakly) with some residues in the 
functionally conserved group. This is the only case of 
significant anti-correlation. Leu69 and Leu76 are ranked 
5th and 2nd, respectively. They have nearly zero cor- 
relation among themselves and with residues in the two 
groups. Interestingly, they are located between the two 
groups in the native structure. Residues Hisll9 and 
Phel23 are ranked 3rd and 4th. They are in the loop be- 
tween helices G and H and they are probably important 
to stabilize the second group of conserved residues. A 
similar observation holds for residues Vall3 and Alal34, 
ranked 7th and 8th, respectively, since they are neigh- 



bors of residues Trpl4 and LeullS. Taken together this 
data suggests that the analysis of cross-correlations com- 
plements Ptitsyn's conservation analysi s of s tructurally 
important residues (Shakhnovich et al. 1996; Ptitsyn & 
Ting 119991 ). 



Temporal correlations 

To investigate the temporal correlations, we measured 
the auto-correlation function 

C(x(AO,^(A.+J) = lf;^(^^M^±±^il^, (6) 

fc=i 

where an average is taken over all pairs of sequences con- 
nected through an evolutionary trajectory of exactly n 
substitutions. The corresponding plot can be seen in 
Fig. ^. The auto-correlation function starts with the 
value one a,t t — Q and then decreases as subsequent sub- 
stitutions make the sequences less and less correlated. 

The characteristic time after which the correlation 
function reduces of a factor 1 /e can be estimated through 
an exponential fit of the first part of the correlation func- 
tion. For all proteins examined, the characteristic length 
is two or three substitutions. Despite this fast decay, 
temporal correlations are responsible of many interesting 
properties of the model, from the large deviation of the 
distribution of the number of substitutions with respect 
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FIG. 3: Comparison between cross-correlations and conser- 
vation. We show the mean fraction of neutral mutations at 
position i, 'xl, and the correlation between a and the overall 
neutral connectivity x, d. In the upper part of the figure, 
the dashed horizontal lines indicate one standard deviation 
from the mean (indicated by the full horizontal lines) . The ar- 
rows at the bottom indicate the position i for which the mean 
fraction of neutral neighbors aJI and the cross-correlation d 
are below the threshold (1.5 standard deviations) shown as 
full horizontal lines. In the lower part of the figure, the two 
quantities are plotted against each other (the horizontal and 
vertical lines have the same meaning as in the upper plots). 
The residues which are above the threshold for both quan- 
tities are shown as full circles, the residues which are below 
the threshold for both quantities are shown as full squares, 
whereas the residues which are above the first threshold but 
below the second, or vice versa, are shown as open circles. 
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FIG. 4: Auto-correlation function C{x{Ai),x{Ai+n)) of neu- 
tral connectivities at sequences separated by n substitutions. 



rameters are very high will be much more tolerant to 
mutations than an average sequence: In this case, every 
positions will have a high neutral connectivity, since most 
amino acid changes at the position will maintain the pro- 
tein viable. The only exceptions will be positions which 
perform a key structural role, which, as we have seen, 
are characterized by a much smaller neutral connectivity 
and are much less correlated with the neutral connectiv- 
ity of the whole protein. The temporal correlations can 
be explained exactly in the same way: Since the ther- 
modynamic properties change smoothly with changes in 
the sequences, they will remain very correlated after few 
changes in the sequences, so that also the neutral con- 
nectivities which derive from these stability parameters 
will show strong correlations. 

To support our argument, we show results from a sim- 
ulation where just one thermodynamic parameter has 
been considered to select sequences belonging to the neu- 
tral network. In the additional simulations presented in 
Fig. ^(a), only the native energy per residue has been 
used as a selection parameter. Accordingly, there is a 
very strong relationship between this thermodynamic pa- 
rameter and the neutral connectivity a; (A). In Fig. ^(b), 
both the normalized energy gap and the native energy 
have been used as selection parameters. Thus the rela- 
tionship between thermodynamic parameters and neutral 
connectivity is less strong, but still it is clearly present. 



to a Poissonian one to large variations of the neutral sub- 
stitution rates in different evolutionary trajectories (see 
also (Bastolla et al. ^002b[ )). 



Origin of the correlations 

The phenomenon that originates the spatial and tem- 
poral correlations that we have observed in the SCN 
model can be described as follows. We impose in our 
model that some stability indices are above some pre- 
defined thresholds in order to decide that a sequence is 
viable. Thus, a sequence where all of the stability pa- 



Multiple substitutions 

According to the molecular clock hypothesis, the num- 
ber of substitutions in a time interval t is proportional 
to the duration of the time interval, with the addition of 
random fluctuations, whose standard deviation is small 
with respect to the mean when t is large. However, in the 
comparison of two protein sequences, only the number of 
sites occupied by different amino acids, N^, is accessi- 
ble to measurement. From this quantity, the number of 
substitutions has to be calculated assuming some model 
of protein evolution to correct for possible occurrence of 
multiple substitutions at the same site. 
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FIG. 5: Dependence of the fraction of neutral neighbors x 
on the native energy per residue E/N (in arbitrary units), 
(a) when only the native energy per residue has been used as 
a selection parameter, and (b) when this parameter is used 
together with the normalized energy gap as a selection pa- 
rameter. (Figure has reduced file size and quality, original 
figure upon request.) 




FIG. 6: Distribution of the average substitution rate Xi across 
different positions. 



The simplest such meth od was used already by Zuck- 

Assuming that all sites in a 



erkandl & Pauling (gJ62|). 
protein evolve at the same rate, and that the number 
of substitutions for each site follows a Poissonian distri- 
bution of average value K , the probability that one site 
did not change is exp(— iiT). This probability can be es- 
timated by the fraction of unchanged sites, 1 — N^/N, 
so that the expected number of substitutions per site K 
can be estimated as 



K 



1 ( 1 



(7) 



The assumption that the substitution rate is the same 
at all sites is stronger than the homogeneity assump- 
tion stating that the substitution rate is the same for 



all sequences. It is well known that functionally impor- 
tant residues change very rarely during evolution, often 
as the result of a change in the environment or a switch in 
the protein function, while other residues are much more 
free to mutate. Even within our model, which does not 
take into account functional constraints explicitly, some 
residues are much more difficult to mutate than others, 
since they are structurally important and belong for in- 
stance to the so-called folding nucleus of the protein. In 
Fig. ^ we show the distribution of the average fraction of 
substitutions at position i which are neutral, for all 151 

positions in the myoglobin fold. 

As noticed by Nei & Kumar (^000|) and by Kimura 
( 1983 ), the Poissonian correction Eq. (Q) remains valid 
also in case of varying evolutionary rates for small values 
of the fraction of changed positions, N^/N. Otherwise, 
one has to consider more sophisticated corrections. The 
most common method which takes into account variation 
of rate across sites consists in assuming that the rates are 
distributed according to a gamma distribution, which is 
the product of an exponential times a power law distri- 
bution (Uzzell & Corbin 1971). It is then possible to 
compute analytically the probability that a site has not 
changed and, equating it to the observed frequency, to 
obtain the average number of substitutions per site as 

-l/a 

1-^1 - 1 



K = a 



N 



(8) 



where a is the shape parameter of the gamma distribu- 
tion (Nei & Kumar [2000| ). The above results, Eq. (0), is 
recovered in the limit a — > oo, as in this limit the gamma 
distribution becomes a delta distribution with vanishing 
variance. However, in realistic situations, a tends to be 
small (typically smaller than four), so that the distribu- 
tion of rates across different positions is broad and the 
estimated number of substitutions is much larger than 
the one obtained using Eq. 

In typical studies, the parameter a is first estimated, 
for instance by maximum parsimony, and then used to 
obtain the number of substitutions through Eq. (^) . Our 
simulations yield the number of substitutions for a given 
fraction of mutated positions, Nd/N, so that, using our 
data, we can directly test the vahdity of Eq. (||). 

Results of such an analysis are shown in Fig. 0. From 
Fig. 0(a) it is clear that two regimes are encountered. 
The first regime (large sequence similarity characterized 
by small N^/N) will be called the transient regime. In 
this case the average sequence similarity decreases on the 
average with time, measured as the number of substi- 
tutions, and the standard deviation is small. We can 
thus use the measured similarity to estimate the number 
of substitutions. This regime is illustrated in Fig. 0(b), 
where it is compared with various kinds of corrections for 
multiple substitutions. As it can been seen, the gamma 
corrections with small a fits the data better than the 
Poissonian correction (a = oo), although systematic de- 
viations are appreciable. Only for lysozyme we find an 
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FIG. 7: Average number of substitutions K &s & function of 
the logarithm of the fraction of residues which are the same 
as in the initial sequence, — ln(l — Nd/N). (a) The transient 
regime and the stationary regime are shown, (b) Only the 
transient regime is shown, together with the analytical cor- 
rections for multiple substitutions for various values of the 
shape parameter a, Eq. (H). The value a = oo corresponds to 
Poissonian corrections. 



optimal value of a close to eight, whereas for all other 
proteins the optimal value, although not very precisely 
determined, lies in the range from one to five. For low 
sequence similarities, it is not allowed to neglect the prob- 
ability p Pi 1/20 that a further mutation of a mutated site 
restores the amino acid initially present in the sequence. 
Taking this into account, Eq. (|^) has to be modified to 



K = \n{l-p- pK) - In ( 1 - — - p 



N 



which, for small p, yields 
if = -(1 -p)ln ( 1 - 



Nil-p), 

Similarly, Eq. (H) has to be modified to 



(9) 



(10) 



1-p- 



N 



= 1 



1-p-p- 



aK 



(11) 



a + K, 

which for small p can be approximatly solved to yield 



K « a(l -p) 



1 



-l/a 



N{l-P), 

The previous formulas are recovered for p 



1 



■ (12) 
0, while 




FIG. 8: Minimal sequence similarity above which the num- 
ber of substitutions can be estimated, 1 - N^""^ /N vs N~^^^, 
where A'^ is protein sequence length. The data refers to tra- 
jectories with KN = 20000 substitutions. The line shows a 
linear fit. 



Eq. (|l^) is recovered as the infinite a limit of the above 
equations. Notice that, for non-zero p, the expected num- 
ber of substitutions now diverges when the sequence sim- 
ilarity 1 — Nd/N tends to p, and can not be computed 
anymore for smaller sequence similarity. 

The second regime is called the stationary regime 
(right part of Fig. 0(a)). In this case, sequence similar- 
ity has reached a stationary distribution with respect to 
the number of substitutions, and it docs not give us any 
information about the number of substitutions occurred 
in the evolutionary history. Clearly, in this case we can- 
not anymore identify the probability that a site has not 
changed, exp(—K), with the frequency of identical sites, 
1 — Nd/N, neglecting the fluctuations of the latter quan- 
tity, and consequently the above formulas loose their va- 
lidity. The value of sequence similarity below which the 
transition to the stationary regime takes place depends 
on the sequence length N and, more weakly, on the num- 
ber of substitutions, KN, and is such that the station- 
ary probability to observe a sequence similarity equal to 
Nd/N is roughly equal to 1/KN. In Fig. | we plot the 
threshold of sequence similarity below which the stan- 
dard deviation in the number of substitutions becomes 
of the same order of the average number of substitutions, 
and hence estimating the number of substitutions looses 
all meaning. Notice that, if some positions are conserved 
because of functional or other constraints, this thresh- 
old of similarity will be even higher. The data has been 
obtained from our simulations with KN — 20000 sub- 
stitutions for all nine proteins. The plot clearly shows 
the expected dependence of the threshold on the inverse 
square root of sequence length. 



Time dependence of the substitution rate 

In our analysis, two kinds of averages must be distin- 
guished. We indicate by angular brackets ( • ) the average 
over the mutation and acceptance process for a given real- 
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FIG. 9: Dependence of the average substitution rate on the 
time interval considered. Shown is the average number of 
substitutions (5t) divided by /it. We present the statistics 
for the trajectories obtained through the SON model (full cir- 
cles) and the same for the corresponding annealed trajectories 
(open circles). The horizontal lines indicate a; and (l/a;)"^, 
respectively. 



ization of the evolutionary trajectory, and by an overline 
• the average over evolutionary trajectories. We deter- 
mine the mean number of substitutions ( 5*4 ) within a 
time interval t, and show this quantity in Fig. 0. We also 
show in the same figure results based on annealed tra- 
jectories, obtained by extracting at random the values of 
X at each substitution event according to the observed 
distribution P{x) (cf. Fig. |l|). In this case, the different 
X along the trajectories are independent variables. Note 
that the annealed trajectories 'interpolate' between the 
Poissonian case (all x are equal) and the correlated tra- 
jectories obtained through the SCN model: (i) The com- 
parison between the annealed trajectories and the sim- 
ple Poissonian case allows us to identify the effect of the 
broad distribution of the number of neutral neighbors, 
whereas (ii) the comparison between the actual and the 
annealed trajectories allows us to identify the effect of 
correlations. 

As a result of the broad distribution of connectivities, 
the substitution rate {St)/t is not constant, as in the 
standard model of neutral evolution, but decreases as 
the time interval t gets longer, as shown in Fig. || for 
both correlated and annealed trajectories. This can be 
qualitatively understood by the following consideration: 
In the annealed case, the time spans t between subse- 
quent substitutions arc independent variables distributed 
with the density D{t) — P(x) (/ia;)~^ exp(— /ixr) da;, 
whose average value is r = /p P{x) ifj.x)-'^ dx. Thus, the 

average substitution rate (^St^/t is not constant in time 
as in the Poissonian case. Initially, St is a Poissonian 
variable with average rate ^x. At large time, however, 
the rate converges to the smaller value {St)/t « 1/r (i.e., 

{St) /{fJ't) w {l/x)~^), since the process spends more and 
more time in sequences with small x(A). 



Variances 

The variance of the substitution process has been in- 
tensively studied in the first tests of the neutral theory. 
In those tests it was expected that the mean and the vari- 
ance of the number of substitutions should be equal un- 
der neutral evolution, as the latter was expected to follow 
Poissonian statistics. The ratio between the variance and 
the mean, R = V{St)/E{St), where V indicates variance 
and E indicates expectation value, is called dispersion in- 
dex. Deviations of this quantity from unity indicate devi- 
ations from Poissonian statistics. In the framework of the 
SCN model, the broad distribution of neutral connectiv- 
ities causes the substitution process to be overdispersed 
(its variance is larger than the mean), as we have shown 
previously (BastoUa et 



2002b). Here we investigate 



the relative contributions of the broad distribution and 
of the time correlations to the variance of the substitution 
process. 

The variance of the substitution process can be decom- 
posed in two components, one calculated for a fixed evo- 
lutionary trajectory (fixed population), the other taking 
into account the variance of different evolutionary trajec- 
tories. 



YiSt)^Y^iSt)+Yx{St) 



{(S?) - {S.r) + [{Sty - (St)") 



(13) 



The first term, V^, is the variance of the mutation and 
acceptance process for a fixed trajectory, averaged over 
evolutionary trajectories. The second term, V^;, is the 
variance of the substitution rate with respect to differ- 
ent evolutionary trajectories. This term, which is not 
present in the standard neutral model, is responsible of 
the fact that the variance of the number of substitutions 
is typically larger than its mean value, in contrast with 
a Poissonian process. 

We shall denote by i?^ the normalized mutational 
variance, i?^(S't) — V^(5't)/E(5't), and by R^ the cor- 
responding normalized trajectory variance, Rx{St) — 
Va;(S't)/E(iS't). Thus, in the standard neutral model, one 
expects that R^^ = 1 and Rx = 0, independent of the 
time interval t. 

The results of the substitution process based on sim- 
ulated trajectories are shown in Fig. |lo| as full symbols. 
We also show as open symbols results based on annealed 
trajectories, which have the same distribution of neutral 
connectivities but lack the correlations. We start de- 
scribing the latter results. As discussed in the previous 
section, for very short time interval the substitution dis- 
tribution is practically Poissonian and one has R^ ~ 1 
and Rx ~ 0. As the time interval gets longer, the two 
variances grow only moderately: The mutation variance 
i?p apparently reaches very soon a stationary value which 
is just slightly larger than one for all of the proteins that 
we studied, while the trajectory variance Rx has not yet 
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FIG. 10: Variances of the substitution process as a function of 
the time interval considered. We show the normahzed muta- 
tion variance (squares), the normahzed trajectory variance 
Rx (triangles), and the normalized total variance R — R^~\~ Rx 
(diamonds). We present the statistics for the trajectories ob- 
tained through the SCN model (full symbols) and the same 
for the corresponding annealed trajectories (open symbols). 
Notice that annealed trajectories, with a broad distribution 
of neutral connectivities but lacking temporal correlations, 
have a total dispersion index only slightly larger than one, 
while simulated trajectories have a total dispersion index of 
the order of two already after some tens of substitutions. 



FIG. 11: Substitution processes (the average substitution 
{St}) obtained from three different evolutionary trajectories. 
The horizontal lines indicate x and (l/x)~^, respectively. 



^jY{St)- Thus fluctuations of the number of neutral sub- 
stitutions do not hide completely the pattern arising from 
the evolutionary relationships. 



Lineage effects and the generation time effect 



attained a stationary value when the number of muta- 
tion events is of the same order as the protein length, 
but still remains very small, of the order of the ratio be- 
tween the variance and the average value of the neutral 
connectivities x. 

Comparing the two ensembles of trajectories, we note 
that the presence of correlations has only a weak effect 
on the normalized mutation variance R^, which remains 
very close to the value met for annealed trajectories. 
However, the normalized trajectory variance Rx increases 
considerably in response to the correlations, as Rx ~ 1 
for jit — 200. Surprisingly, Rx even grows with time, 
although more and more sequences are used to compute 
the mutational averages and one could expect that such 
averages approach typical values. Thus, surprisingly the 
large fluctuations between different trajectories caused by 
the strong correlations result in a peculiar phenomenon: 
Even averaging over an arbitrary long trajectory does not 
give values representative of typical trajectories. 

The total dispersion index R = Rx + Rf^ calculated 
for the trajectories obtained through the SCN model, 
becomes larger than 2 already for jjd « 100, in quali- 
tative agreement with Gillespie's results for a large set of 
proteins, most of which yielded dispersion indices in the 



range 1 < i? < 5 (Gillespie 1991). Hence, these large dis- 
persion indices observed in protein evolution may be to a 
sizeable extent due to the correlations present in neutral 
trajectories in sequence space. 

Notice however that, although the dispersion index is 
larger than one, still the main property of a Poissonian 
molecular clock is preserved, since, when the number of 
substitutions is large, the average number of substitu- 
tions {St) is much larger than the standard deviation 



The results presented in the previous sections imply 
the values of x along one given evolutionary trajectory 
are very similar for several steps, because of the strong 
temporal correlations we identify. Hence, we will observe 
trajectories characterized by large substitution rates and 
trajectories characterized by small substitution rates, as 
a result of the broad and correlated fluctuations in the 
neutral connectivities. As an example, we show in Fig. |ll| 
the substitution rates obtained from three different evolu- 
tionary trajectories. Note that the variance of the neutral 
rates {St)/t across different evolutionary trajectories de- 
creases with the time interval t, but rather slowly, so that 
even after a sizeable time interval the substitution rates 
of the three trajectories are still significantly different. 

Since different trajectories can be interpreted as differ- 
ent species, different substitution rates in different trajec- 
tories can simulate lineage effects, i.e. variations of the 
substitution rate among different taxonomic groups. One 
such effect is the generation time effect: Since the natural 
time unit for measuring substitution events is the genera- 
tion, species with longer generation time are expected to 
show a slower substitution rate. This prediction has been 
verified comparing for insta nce su bstitutio n rate s in ro- 
dents and primates (Britten 1986 ; Li et al. 1987 Grauer 
& Li 2000 ). Interestingly, the effect is significantly larger 
for synonymous substitutions (DNA changes which still 
code for the same amino acid) than for non-synonymous 
ones. Non-synonymous substitutions are superimposed 
with the large and correlated fluctuations in the sub- 
stitution rate that we just described, and which could 
be strong enough to obscure the generation time effect. 
Thus, the statistics of neutral substitutions observed here 
could explain the quantitative difference between synony- 
mous and non-synonymous substitutions. 



13 



Tests of neutrality 

A better understanding of the statistical properties of 
neutral evolution can help to single out from a back- 
ground of neutral substitutions the more interesting cases 
of positive selection as, for instance, changes in the pro- 
tein function and responses to variations of the environ- 
ment. 

The best current bioinformatics method to identify 
such cases of positive selection is the one proposed by 



ber of substitutions taking place in an evolutionary tra- 
jectory whose length t is double of the time spent since 
the last common ancestor of the two populations, 



McDonald & Kreitman (1991), recently used with some 
modifications to study the evolution of Drosophila genes 
(Fay et al. |2002| ; Smith & Eyre- Walker |2002D . This 
method assumes implicitly that the neutral substitution 
process has a constant rate ^x. Thus, the broad and cor- 
related fluctuations in the neutral connectivities should 
be taken into account to provide a better null hypothesis 
to compare with. 

McDonald and Kreitman's method examines two sam- 
ples of sequences from two related populations and dis- 
tinguishes two classes of nucleotide differences in pro- 
tein coding genes: intra-population differences (polymor- 
phisms) and positions which are fixed in any two popula- 
tions but differ between the two (fixed differences). Each 
class is further divided into synonymous (S) and non- 
synonymous (A) differences. Let us denote by Dps and 
DpA the number of synonymous and non-synonymous 
polymorphisms, respectively, and by Dps and DpA the 
number of synonymous and non-synonymous fixed differ- 
ences, respectively. Under the neutral hypothesis, poly- 
morphisms and substitutions are just the two faces of the 
same coin. Let us suppose that all synonymous mutations 
are neutral, and that all deleterious mutations are quickly 
removed from the population and do not show as poly- 
morphisms. Under this hypothesis, the ratio Dpa/Dps 
of non-synonymous to synonymous polymorphisms can 
be interpreted as the fraction of neutral mutations, x, in 
the two populations. Analogously, Dfa/D^s can be in- 
terpreted to represent the fraction of neutral mutations, 
X, over an evolutionary trajectory. McDonald and Kre- 
itman assume that the two quantities are equal under 
neutral evolution. Thus, in their method, the neutral 
expectation is that Dpa/Dps and Dfa/-Dfs are equal. 
Conversely, if the latter ratio is significantly larger than 
the former, this is interpreted as an evidence that some 
of the non-synonymous substitutions have been fixed by 
positive selection, since in this case fixation is much faster 
than for neutral mutations, and a larger number of sub- 
stitutions can be fixed if they bring any selective advan- 
tage. 

Let us now see how the neutral expectation must be 
modified taking into account the statistics of neutral sub- 
stitutions. The ratio of non-synonymous to synonymous 
polymorphisms in two populations can be interpreted as 
the fraction of neutral mutations in the two populations, 
x(Ai) and x{A2) respectively. On the other hand, the 
ratio Dfa/Dfs should be interpreted as the average num- 



Dfa (St) 



Dps 



fit 



(14) 



where the average is the mutational average along the 
evolutionary trajectory which connects the two wild-type 
sequences Ai and A2. Now, if the number of substitu- 
tions between these two sequences is small with respect 
to the "correlation time" after which values of x{A) be- 
come uncorrelated, then we expect that the effective frac- 
tion of neutral mutations represented in Eq. (Q) is close 
to both x(Ai) and a;(A2), and the neutral expectation 
of McDonald and Kreitman remains valid. Otherwise, 
we expect that Dpa/Dps is smaller than the larger be- 
tween the two neutral fractions: As shown in Fig. ^, the 
effective substitution rate decreases as longer time inter- 
vals are considered, because the evolutionary process gets 
trapped in sequences with small neutral connectivity. As 
a rule of thumb, if x{Ai) and x(A2) are very similar and 
the number of fixed substitutions DpA is of the order of 
10, then the neutral expectation of the McDonald and 
Kreitman method is likely to be valid, otherwise correc- 
tions should be considered. 



CONCLUSIONS AND PERSPECTIVES 

In this work we have proven that the statistical proper- 
ties of neutral substitutions simulated through the SCN 
model, which imposes conservation of the thermody- 
namic stability of the protein, differ considerably from 
those predicted by the standard Kimura's model, which 
assumes that the fraction of neutral mutations is con- 
stant throughout evolution. These differences have im- 
portant consequences for several aspects of neutral evolu- 
tion. Only taking into account accurate statistical prop- 
erties of a neutral model allows to test the neutral hy- 
pothesis as the null hypothesis of molecular evolution, 
against which adaptive substitutions have to be distin- 
guished. 

We have shown that the differences between the stan- 
dard neutral model and the SCN model arise because 
of correlations: (i) There are strong spatial correlations 
between the neutral connectivities of different positions 
along the protein chain, even if the corresponding amino 
acids are not close in space. As a consequence, the neu- 
tral connectivities of the entire sequences are broadly dis- 
tributed, in contrast with the standard model, (ii) There 
are strong temporal correlations between the neutral con- 
nectivities of sequences nearby along a neutral trajectory. 
As a consequence, the substitution rates of different evo- 
lutionary trajectories, even averaged over a sizeable num- 
ber of substitutions, can be significantly different. 

We have illustrated a number of important conse- 
quences of these properties: (i) As a result of the broad 
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distribution, the substitution rate is a decreasing function 
of the time interval considered, instead of being constant, 
and attains the stationary value ^{l/x)~^ starting from 
the larger value nx at very small time intervals, (ii) As 
a result of the auto-correlations, even long evolution- 
ary trajectories, corresponding to different species, can 
be characterized by rather different neutral substitutions 
rates. This fact can produce new lineage effects which 
may obscure the generation time effect, and it should be 
taken into account when testing the predictions of the 
neutral hypothesis, (iii) As a result of the broad vari- 
ance of different evolutionary trajectories, the variance 
of the substitution process is not constant in time, and 
it is larger than that expected for a Poissonian process. 
Nevertheless, for large evolutionary separations the stan- 
dard deviation of the number of substitutions is much 
smaller than the mean value, as for a Poissonian process, 
so that neutral substitutions can still be useful for recon- 
structing phylogenetic trees, (iv) Moreover, our simula- 
tions provide useful data to test methods for estimating 
the number of substitutions between two diverging se- 
quences from their sequence similarity. Our results allow 
to estimate the threshold below which sequence similar- 
ity does not provide any information on the number of 
substitutions, and also show that the probability that the 
same amino acid arises as the result of two independent 
substitutions should be taken into account at low simi- 
larity. The neutral model does not make detailed pre- 
dictions about this point, since one still needs to know 
the neutral mutation rates Xi at different positions. It is 
important to note that Kimura's model does not assume 
that such rates are all equal. Nevertheless, our simula- 
tions produce realistic K vs N^/N curves which can be 
used to test correction methods. 

The structural approach to protein evolution has only 
just been initiated. Natural extensions of our model that 
will be considered in forthcoming work include considera- 
tion of the genetic code as well as modeling of the popula- 
tion genetics level and of possible fitness effects of changes 
in the thermodynamic parameters of the protein. We an- 
ticipate that the structural approach will provide key in- 
sights into different areas of molecular sciences, including 
directed evolution of enzymes, rational sequence design, 
phylogenetic reconstruction, protein fold prediction, and 
production of new materials for bio-nanotechnology. 



MATERIALS AND METHODS 
Protein model 

We represent a protein structure by its contact matrix 
Cij, where Cij = 1 if residues i and j are in contact 
and Cij = otherwise. Two residues are considered in 
contact if any two of their heavy atoms are closer than 
4.5 A. The effective free energy associated to a sequence 
of amino acids A in the configuration C is approximated 



as a sum of pairwise contact interactions, 
E{A,C)^Y.a,U{A,,A,), 

i<j 



(15) 



where Ai labels one of the twenty amino acid types and 
U{-,-) is a 20 X 20 symmetric interaction matrix. Here 



we use the matrix derived by BastoUa et al. (2000b), 
which describes accurately the thermodynamic stability 
of a large set of monomeric proteins (Bastolla et al. 2001). 

Three remarks are needed: First, the effective energy 
parameters implicitly take into account the effect of the 
solvent and depend on temperature. They express free 
energies rather than energies. Second, the effective en- 
ergy of a structure is defined with respect to a com- 
pletely extended reference structure where no contacts 
are formed and which sets the zero of the energy scale. 
Third, one can derive from the database not the pa- 
rameters U themselves but the dimensionless quantities 
U/{k^T). It is thus important to use dimensionless pa- 
rameters to evaluate the stability of the protein model. 



Candidate structures 

We generate candidate structures for a protein se- 
quence of N residues by generating all possible gapless 
alignments of the sequence with structures in the PDB. 
This procedure is called threading. In this way, we gener- 
ate many, typically of the order of 10^, protein-like struc- 
tures per sequence. In the present context, threading is 
directly used to produce the contact maps of the candi- 
date structures. In order to speed up the computations, 
we use a non-redundant subset of the PDB excluding pro- 
teins with homologous sequences, selected by Hobohm & 
Sander ([l99^). 



The folding parameter a 

For a given sequence A, the energy landscape is well 
correlated if all configurations of low energy are very 
similar to the configuration of minimal effective energy, 
C*(A). Structure similarity is measured by the over- 
lap ^(C, C*), counting the number of contacts that two 
structures have in common and normalizing it through 
the maximal number of contacts, so that q is comprised 
between zero and one. In a well correlated energy land- 
scape, the inequality 



i;(A,C) -£;(A,C*) 
\E{A,C*)\ 



>a(A)(l-g(C,C*)) (16) 



holds, stating that the energy gap between each alterna- 
tive structure C and the ground state C*, measured in 
units of the ground state energy, is larger than a quantity 
a{A) times the structural distance 1 — g(C, C*). The di- 
mensionless quantity q;(A), which is the largest quantity 
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for which the above inequahty holds, can be used to eval- 
uate the folding properties of sequence A. For random 
sequences, many different configurations have quite sim- 
ilar energy and hence a (A) 0. In this case the energy 
landscape is rugged, the folding kinetics is very slow, and 
the thermodynamic stability with respect to variations in 
the solvent is very low. In contrast, computer simulations 
of well designed sequences have shown that, when a(A) 
is finite, the folding kinetics is fast and the stability with 
respect to changes in the energy parameters as well as 
mutations in the sequence is very high. 

Our algorithm computes the parameter a(A) for a 
fixed target configuration C* and a large number of se- 
quences A. We thus indicate this parameter as a(A, C*), 
since we do not know a priori that C* has lowest energy. 
Notice however that, if a(A, C*) is positive, all alterna- 
tive structures have higher energy than C*. Wc impose 
that a(A, C*) is larger than a positive threshold athr for 
sequences A belonging to the neutral network. 



The Z-score 



The Z -score Z(A, C*) (Bowie et al \L99l\ Goldstein et 



al. 1992 ) is a measure of the compatibility between a se- 
quence A and a structure C*, widely used in structure 
prediction. It depends on an effective energy function, 
and measures the difference between the energy of se- 
quence A in configuration C* and its average energy in 
a set of alternative configurations, {C}, in units of the 
standard deviation of the energy. 



Z{A,C*) 



(i?(A,C)2)^-(i?(A,C))^ 



(17) 



When sequence A folds in structure C*, its correspond- 
ing Z-score is very negative. 

Given the above definition, one has still to specify how 
to select alternative structures. A possibility, often used 



for lattice models (Mirny & Shakhnovich 1996) is to as- 



sume that alternative structures are maximally compact, 
randomly chosen structures, whose average energy can be 
estimated as (£^(A, C))^-, = iVci„ax(e(-A-)). Here, iVcmax 
is the maximal number of contacts of candidate struc- 
tures and (e(A)) is the average energy of a contact, av- 
eraged over all possible contacts formed by sequence A. 
This leads to introduce the parameter 



Z'{A,C*) 



E{A,C*)/Nc„ 



(e(A)) 



^{e^A))-{e{A)y 



(18) 



The use of Z' has two main advantages: (i) It makes the 
value of the Z-scoie much less sensitive to chain length N 
and to the particular set of alternative structures used, 
(ii) The evaluation of Z' is much faster than that of the 
Z-score. This is necessary in order to explore efficiently 
sequence space. We impose that — Z'(A, C*) is larger 



than a positive threshold —Zthi for sequences A belong- 
ing to the neutral network. 



Sampling the neutral network 

Our algorithm explores the neutral network of a given 
protein starting from its PDB sequence Ai and iterating 
the following procedure: At iteration i, (i) the number 
X{Ai) of viable neighbors of sequence A^ is computed, 
and (ii) the sequence A^+i is extracted at random among 
all the viable neighbors of A^ . In this way we generate a 
stochastic process along the neutral network which sim- 
ulates neutral evolution and looses memory of the initial 
sequence very fast. 

Sequence A is regarded as viable if both parame- 
ters a(A, C*) and —Z'{A,C*) are above predetermined 
thresholds, chosen as 98.5% of the values of those param- 
eters for the sequence in the PDB. This enforces conser- 
vation of the thermodynamic stability and folding capa- 
bility of the native structure C*. We verified that the ob- 
served behavior does not change qualitatively for thresh- 
olds between 95% and about 100% of the PDB values. 

We impose strict conservation of the cysteine residues 
in the PDB sequence, and do not allow any residue to mu- 
tate to cysteine, since a mutation changing the number 
of cysteine residues by one would leave the protein with 
a very reactive unpaired cysteine that would probably 
affect its functionality. Accordingly, the maximal possi- 
ble number of neighbors tested is X^nax — 18{N — Ncys), 
where N is the number of residues and A'cys is the number 
of cysteine residues in the starting sequence. The total 
number of viable point mutations, X{A), expresses the 
local connectivity of the neutral network. We normal- 
ize it by the total number of neighbors, X,„ax, getting 
the fraction of neutral neighbors, x{A) — X(A)/Xmax £ 
(0,1]. 

To compute a; (A), we have to evaluate the a parame- 
ter for all sequences A' obtained through a point muta- 
tion of sequence A. From Eq. ( p^ we note that the a 
parameter can be obtained from the configuration with 
the highest destabilizing power, i.e. the highest value of 
the energy gap divided by the structural distance from 
the native configuration. These change from sequence 
to sequence, but it is expected not to change very much 
for neighboring sequences. Thus, in order to speed up 
the computation of a(A', C*), instead of considering all 
candidate configurations we consider only the 50 con- 
figurations with the highest destabilizing power (i.e. the 
energy gap divided by the structural distance from the 
native configuration) for sequence A and compute their 
mutated destabilizing power using sequence A'. The a 
parameter is then obtained from the configuration with 
the highest destabilizing power. This procedure could 
slightly overestimate a(A',C*) since not all configura- 
tions are used, but we have checked that the error intro- 
duced in the x value is in all cases below 0.1%. 
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Cross-correlations 

To obtain a deeper insight into the mechanism of neu- 
tral evolution, it is helpful to study the cross-correlations 
between the fraction of neutral neighbors for two given 
residues of a protein of N residues. The analysis starts 
from N individual trajectories {xi{Ai), Xi{A2), . . .} of 
m evolutionary steps for each residue i. We de- 
fine the average fraction of neutral neighbors Xi — 
(1/to) X^fcLi 2;i(Afe) and the corresponding variance erf — 
x1 —Tcf' for each residue i. Then, we calculate the corre- 
lation matrix Cy, where Cij = Cji determines the cross- 
correlations between residue i and j and is defined as 



where k, the number of attempted mutations, is a Pois- 
sonian variable of average value ^t. 

In order to handle the computation, we divide all val- 
ues of a: in M classes, choosing 

Xc SI'S representative value 
of all x's belonging to class c. The number of operations 
needed to evaluate the substitution probability increases 
exponentially with the number of classes M. At the same 
time, the evaluation becomes more and more accurate as 
M increases. We chose M = 6 in our numerical com- 
putations as a reasonable compromise between accuracy 
and rapidity, checking that larger values of M introduce 
only small changes. 



= ^- (19) 

TO ^ — ' CTiCr,- 

fc=l ^ 

Due to the enforced conservation of cysteine, these 
residues require a special treatment; If residue i is cys- 
teine, then dj = Cji = for all j. 

One needs to distinguish three different cases, (i) Cij — 

means that the fraction of neutral neighbors for residues 

1 and j are uncorrelated, (ii) Cij > indicates that the 
fraction of neutral neighbors for residues i and j are cor- 
related, and (iii) Cy < shows that the fraction of neu- 
tral neighbors for residues i and j are anti-correlated. 

Alternatively, one may also study the cross- 
correlations between the fraction of neutral neighbors 
for one given residue and the fraction of neutral neigh- 
bors for the whole protein. Defining the average frac- 
tion of neutral neighbors for the whole protein x = 
(1/to) X^fcLi 2;(Afc) and the corresponding variance — 
x"^ —xP', we calculate the correlation vector Ci which de- 
termines the cross-correlations between the fraction of 
neutral neighbors for residue i and the fraction of neu- 
tral neighbors for the whole protein, 

1 ™ (x^{Ak) - xl) {x(Ak) - x\ 

TO 

Here, again three different cases need to be distinguished, 
(i) Ci = Q indicates that the fractions of neutral neigh- 
bors for residues i and for the whole protein are uncor- 
related, (ii) Ci > means that the fractions of neutral 
neighbors for residues i and for the whole protein are 
correlated, and (iii) < shows that the fractions of 
neutral neighbors for residues i and for the whole protein 
are anti-correlated. 



Substitution process 

Given an evolutionary trajectory {a:(Ai), a;(A2), . . .}, 
the distribution of the number of substitutions taking 
place in a time t can be computed by considering Eq. (||) , 
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